|
The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates–Gonnet algorithm) is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance ''k'' of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast. The bitap algorithm is perhaps best known as one of the underlying algorithms of the Unix utility agrep, written by Udi Manber, Sun Wu, and Burra Gopal. Manber and Wu's original paper gives extensions of the algorithm to deal with fuzzy matching of general regular expressions. Due to the data structures required by the algorithm, it performs best on patterns less than a constant length (typically the word length of the machine in question), and also prefers inputs over a small alphabet. Once it has been implemented for a given alphabet and word length ''m'', however, its running time is completely predictable — it runs in O(''mn'') operations, no matter the structure of the text or the pattern. The bitap algorithm for exact string searching was invented by Bálint Dömölki in 1964 and extended by R. K. Shyamasundar in 1977, before being reinvented in the context of fuzzy string searching by Manber and Wu in 1991 based on work done by Ricardo Baeza-Yates and Gaston Gonnet. The algorithm was improved by Baeza-Yates and Navarro in 1996 and later by Gene Myers for long patterns in 1998. ==Exact searching== The bitap algorithm for exact string searching, in full generality, looks like this in pseudocode: algorithm bitap_search(text : string, pattern : string) returns string m := length(pattern) if m == 0 return text / * Initialize the bit array R. */ R := new array() of bit, initially all 0 R() = 1 for i = 0; i < length(text); i += 1: / * Update the bit array. */ for k = m; k >= 1; k -= 1: R() = R() & (text() == pattern()) if R(): return (text+i - m) + 1 return nil Bitap distinguishes itself from other well-known string searching algorithms in its natural mapping onto simple bitwise operations, as in the following modification of the above program. Notice that in this implementation, counterintuitively, each bit with value zero indicates a match, and each bit with value 1 indicates a non-match. The same algorithm can be written with the intuitive semantics for 0 and 1, but in that case we must introduce another instruction into the inner loop to set R |= 1 . In this implementation, we take advantage of the fact that left-shifting a value shifts in zeros on the right, which is precisely the behavior we need.Notice also that we require CHAR_MAX additional bitmasks in order to convert the (text() == pattern()) condition in the general implementation into bitwise operations. Therefore, the bitap algorithm performs better when applied to inputs over smaller alphabets.抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Bitap algorithm」の詳細全文を読む スポンサード リンク
|